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Abstract. 

There are a number of different phenomena in the early universe that have to be stud- 
ied numerically with lattice simulations. This paper presents a graphics processing unit 
(GPU) accelerated Python program called PyCOOL that solves the evolution of scalar fields 
in a lattice with very precise symplectic integrators. The program has been written with 
the intention to hit a sweet spot of speed, accuracy and user friendliness. This has been 
achieved by using the Python language with the PyCUDA interface to make a program 
that is easy to adapt to different scalar field models. In this paper we derive the symplec- 
tic dynamics that govern the evolution of the system and then present the implementation 
of the program in Python and PyCUDA. The functionality of the program is tested in a 
chaotic inflation preheating model, a single field oscillon case and in a supersymmetric cur- 
vaton model which leads to Q-ball production. We have also compared the performance of 
a consumer graphics card to a professional Tesla compute card in these simulations. We 
find that the program is not only accurate but also very fast. To further increase the use- 
fulness of the program we have equipped it with numerous post-processing functions that 
provide useful information about the cosmological model. These include various spectra 
and statistics of the fields. The program can be additionally used to calculate the gener- 
ated curvature perturbation. The program is publicly available under GNU General Public 
License at https://github.com/jtksai/PyCOOL. Some additional information can be found 
from http : / / www. physics .utu. fi / tiedostot / theory / particlecosmology / pycool / . 
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1 Introduction 

Recent developments in computer technology have made graphic processing units (GPUs) 
an increasingly popular way to solve computationally difficult problems [1-21]. The shear 
floating point power of GPUs is however useless if it cannot be harnessed effectively. The 
use of a suitable language is therefore of paramount importance. During the last years 
the programming of modern GPUs has thankfully evolved drastically and it has become 
relatively easy. NVIDIA's CUDA architecture [22] and the OpenCL language [23] of the 
Khronos Group are very similar to C syntax and therefore lower considerably the learning 
curve needed to write programs that use GPUs as co-processors. In addition to these there 
are now multiple other publicly available computer languages that can be used to call GPUs 
directly or through one of the two aforementioned languages. PyCUDA [24] is an interesting 
and a powerful interface to NVIDIA's CUDA that is based on the Python scripting language. 
Together with its textual templating functionality it allows to easily modify the CUDA kernels 
which make it possible to make highly dynamical GPGPU programs that can be changed 
even at runtime. 

Cosmology offers many computationally demanding problems [25-27] that can be used 
to test and to develop current theories of particle physics. The preheating phase of the post- 
inflationary universe is an interesting subject that has been studied in a myriad of papers 
(see for example [46]). The non-linear dynamics that govern the evolution of scalar fields 
during this process have to be solved numerically. Fortunately there are a number of publicly 
available programs that designed for this. These include LATTICEEASY [28], DEFROST 
[30], PSpectRe [33] and HLATTICE [34] which are all C, C++ or Fortran based programs 
that use CPU(s) to evolve the scalar fields. 

The lattice calculations are relatively easily parallelizable and therefore suitable for also 
GPU computing. The only program to accomplish this up to this point has been CUD- 
AEASY [32] that was in essence an upgraded version of LATTICEEASY retrofitted to the 
specifications and demands of the GPU. Although it was shown to be much faster than the 
rest of the preheating codes it had many shortcomings that made the use of it unappealing. 
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These include the fact that in order to change to a different scalar field model the CUDA 
kernels would have to be edited significantly. The program was also limited to single precision 
accuracy which can be a limiting factor in the simulations of the early universe. 

Instead of only simply improving the existing code we decided to create a completely 
new program from scratch that takes the best parts of CUDAEASY and eradicates all of the 
known shortcomings. We call this program PyCOOL. The user friendliness of the software 
was made a priority in the development process in order to make a system that allows the 
user to concentrate on the physics instead of rewriting code. This has been accomplished 
by writing an extensive object-oriented Python program that uses PyCUDA to create the 
necessary CUDA kernels needed in the evolution steps and in the post-processing phase. 

The integration system has also been changed completely. Whereas CUDAEASY used 
a second-order leap-frog method to evolve the fields PyCOOL is based on an entirely new 
symplectic method developed by A. Frolov and Z. Huang [31] which is also somewhat related 
to the integrator used in HLATTICE [34] . This symplectic algorithm is found to be especially 
suitable for the GPU since it does not need any auxiliary fields and hence has much smaller 
memory footprint compared for example to an ordinary fourth order Runge-Kutta algorithm. 

One of the weaknesses of CUDAEASY was also the fact that it could only solve the 
evolution of the system. However when analyzing the simulation results variables derived 
from the simulation data provide often more valuable information about the system. We 
have therefore included numerous post processing functions in the program that can calculate 
various spectra used in LATTICEEASY and Defrost programs, statistics and effective masses 
of the fields. All of the results are currently stored in SILO files which can be easily visualized 
in the Visit software. Many of the variables are also written in csv (comma separated 
values) format in case other visualization programs are preferred. On top of these we have 
also incorporated the ability to calculate curvature perturbations generated in during the 
evolution of the system. 

The goal of this paper is to derive the symplectic dynamics that govern the evolution of 
the system and then to present the implementation of the program in Python and PyCUDA. 
It is organized as follows. In section II we present and derive the symplectic equations of 
motion in the linear and non- linear cases. We also present the symplectic integration methods 
used and the implementation of these integrators in PyCUDA. In section III we show the 
numerical results in chaotic inflation, oscillon and curvaton models. We conclude with a 
discussion in section IV. 

2 Symplectic Dynamics 

2.1 Discretization of the System 

We will start by presenting the symplectic method of solving evolution of the system. Sym- 
plectic integrators are often used when the conservation of a physical quantity is paramount 
to the validity of the solutions: for example in the case of celestial mechanics long time 
integrations of the solar system are often done with different symplectic integrators. 

The symplectic method of solving the dynamics during preheating was first developed by 
A. Frolov and Z. Huang [31] and we will follow their method closely. Instead of discretizing 
the equations of motion the starting point is to first discretize the action of the system 
spatially and then to derive the Hamiltonian equations of motion from this discrete action by 
variation. This method follows closely the principles used in different variational integrators. 
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The action from which the equations of motion are derived in general relativity is 

where 1Z is the Ricci curvature scalar and C m is the Lagrange function of the matter content 
of the universe. The preheating phase of the universe is often modeled with semi-classical 
scalar fields that are minimally coupled to gravity and interact with each other through a 
potential term. By demanding Lorentz invariance C m is compelled to be a function of only 
the fields and their temporal and spatial first derivatives and action (2.1) then reads 

s= J ^_^ n + Y^-d^ i d»<t> i -V{<t>x,...,<t> N ))^gd A x, (2.2) 

i 

where index i = 1,...,N labels the different fields and we have defined the reduced Planck 
mass mpi = l/v / 8vrG. 

We will assume that the space-time is spatially homogeneous, isotropic and flat i.e. a 
Friedmann- Robertson- Walker universe. The space-time interval can be then written as 

ds 2 = a(r]) 2 (dr) 2 - dx 2 ), (2.3) 

where a{rf) is the scale parameter and we have used conformal time r/ which makes the 
Hamiltonian equations of motion simpler compared to physical time. From this metric it 
follows that the determinant of the metric tensor equals 

V=9 = a 4 (2.4) 

and the Ricci scalar reads f/ 

ft = -6^, (2.5) 
ar 

where prime denotes differentiation with respect to conformal time. 
The action now reads 



U'2 



S = j ^- 3a' 2 m 2 Pl V L (dxf + J ( £ g_ + _ v(fc, ^))a 4 d 3 xj dr,, (2.6) 

where we have done partial integration with respect to time for the scale parameter term 
and also the integrated over the spatial coordinates. Note that V/^da;) 3 equals the volume 
of the periodic lattice in which the fields are embedded and V is the gradient operator with 
respect to comoving coordinates. In order to optimize the GPU implementation we have also 
done a spatial partial integration to the gradient terms of the scalar field which was also used 
in CUDAEASY [32]. This way the computationally difficult (Vcj)) 2 terms can be eliminated 
from the evolution equations. 

Next step is to make the spatial discretization of the lattice where the system in trans- 
formed into i x n 3 time dependent scalar fields that are coupled to each other. In the dis- 
cretization we have used the second order accurate stencils for the gradient terms presented 
in [36] which were also used in DEFROST [30], CUDAEASY [32] and in the symplectic 
method of Frolov and Huang [31]. In this notation the Laplacian operator reads 

x+i y+l z+i 

D[4>](x,y,z) = ^2 c d(i,j,k)4>(i,j,k) = ^2c d(a) 4>(a), (2.7) 

i=x— 1 j=y— 1 k=z— 1 a 



- 3 - 



where c^ a ) are the discretization coefficients of the Laplacian [36] . We also have 

G[4>] (x, V,z) = ^^2 c d(a) (</>(&) - 4>{x, y, z)) (21 



2 

a 

for (V(/>) 2 terms used in the energy density calculations. Huang has also implemented a 
fourth order accurate stencil in HLATTICE program [34]. Although the implementation of 
this in CUDA is possible it would however demand rewriting of the current CUDA code and 
this is left for future upgrades of PyCOOL. 

With the spatial discretization integrals transform to summations over the grid 



I d 3 x dx3 ( 2 - 

x,y,z 



and the action (2.6) reads 



S=(dxf J 

=(dx) 3 J C[a,(pi, ...4> N ]drj, 



„ r> f - 2 V" 2(\-(^x <l>i,xD[4>i x}{%) \ 2 , -, 

3a V L m Pl + 2^ a { Z_> { ~T~ H ^T~2 ) ~ a V \ ( t ) ^, 4>N,x) 



2dx 2 



d?/ 



(2-10) 

where we have labeled the different scalar fields with the index i and their location in the 
lattice x. 

In order to determine the Hamiltonian % the canonical momentum variables have to be 
determined. From the Lagrangian C we get for the scale parameter 

dC 

Pa = dd = - 6a ' V L m Pi ( 2 - n ) 

and for the field variables 

dC 

vr^= — = aV M , (2.12) 

After a simple Legendre transformation of the Lagrangian function the Hamiltonian function 
of the system equals 

n = -Y2V^ l + ^ a [tofi- 2a 2 dx 2 +V ^..,^)y (2.13) 

The Hamiltonian equations related to this Hamiltonian now read for the scale parameter 
, dl-L p a 



a 



dp a 6V L m 2 f 



"pi 

&H ^ J Ax <t>i$D\<j> iS \(x) , A ( 2A ^ 

A = -^ = L a t + ~^dt 2 W ^ fe) ' 

i,x \ / 

Similarly the equations of motion of scalar field i at grid point z read 

3U TTi 



hZ d( HB ) a 2 



on iDfcdW 4 9V 

7T,- =• = — 777 ; — r = a — -p, a 



(2.15) 



d((f>ig) dx 2 d(4>i 



-A- 



which follow from equation (2.13) by differentiating under the summation sign and by sum- 
ming over the coefficients c^a) ■ 

By noting that the total energy density and pressure of the scalar fields are given by 
the usual formulas 

» = £(^ + ^*> 2 ) + ^-> 
^£(^-^(v*f)-^.> 

i 

and by remembering that in the periodic lattice as a result of the Stoke's theorem 

\- <t>i,xD[(f>i,x](x) _ \^ g[gM(g) / 9ir , 

a: x 

it can be easily seen that the right hand side of the Hamiltonian (2.13) is actually a discretized 
version of the first Friedmann equation 

(a? = ^4-<p>, (2.18) 
3m Pi 

where () denotes average over the volume of the lattice. Similarly it can be easily seen that 
the equation of motion of p a is a discretized version of the second Friedmann equation 

a" = --4-(p-3P>. (2.19) 

In the curvaton scenario and during curvaton decay a homogeneous radiation component 
originating from the inflaton fields is included often in the system. We have therefore modified 
the Hamiltonian (2.13) into form 

(2.20) 

where we have now incorporated homogeneous radiation and non-relativistic matter compo- 
nents into the system. Since the scale parameters in the radiation component now cancel out 
the effect of radiation is only seen in the value of T~L but it does not show in the equations 
of motion of the system, which is consistent with the second Friedmann equation (2.19) for 
radiation. The homogeneous matter component leads to an additional ViPmfl term in the 
equation of motion of p a which is again consistent with the second Friedmann equation (2.19) 
for non-relativistic matter. 

2.2 Linearized equations 

Solving numerically the evolution generated by Hamiltonian (2.20) is computationally very 
heavy due to the number of equations involved, ixn 3 . However as noted in [45] the dynamics 
of the scalar fields (2.15) can be approximated in some cases with linearized equations which 
make the numerical calculations much faster. We will therefore derive also the Hamiltonian 
equations related to this linear case by first separating the field into a homogeneous and 
a non- homogeneous perturbation component. By noticing that the Fourier modes of these 
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linear perturbations do not couple to each other we can reduce the number of solved equations 
drastically and therefore solve the evolution faster. 

We decompose the scalar fields into a homogeneous and non-homogeneous perturbation 

part 

= + Sfoffl ( 2 - 21 ) 
where the sum of the perturbation terms over the lattice is assumed to vanish i.e. 

= 0. (2.22) 

X 

In the code this constraint is satisfied by making sure that the homogeneous k 2 s = mode 
of the perturbed variables stays close to zero at all times. The Lagrangian (2.10) 

C = -3a> 2 V L m Pl + £ a 2 ( £ - + a 2 V(4> h£ , fe)) (2.23) 

x i 

factorizes into 

C = - 3a' 2 V L m 2 Pl + a 2 V L ( £ 1 + a 3 V(<h,i, < 



) 0,A r , 



where 



and 



+ E fl2 (E (Vo/<^) +aV( 1 )(^ lilV .,M) (2.24) 

x i 

+ Z^ a V ^l^ 2^ J +a V >(<l>l,g,~,</>N,x) 

x i 



de 

V2l 



(2.25) 



(2) , . 1 crV(</>o,i + e^i(x), ...,0o,jv + eS(f) N (x)) 

V V >((t>l,x,-.,4>N,x) = g ^2 



(2.26) 

are the potential terms at first and second order in the perturbed variables, respectively. For 

\<4> 2 + h 



example for a simple polynomial potential function V = \m 2 ,(j) 2 + hg 2 <fi 2 x 2 t ne second order 



term reads 

V (2) ((f>, X ) = + \g 2 btfxl + 2g 2 ^xoH5x + (2-27) 

It is now easy to write the Lagrangians related to different orders of the field perturba- 
tions. The background Lagrangian simply reads 

£(0) = -3a' 2 V L m 2 Pl + a 2 V L ^l(^y + a2v (^,i,..-,M). (2.28) 

i 

At first order the Lagrangian term can be simplified by partial integrating a 2 (p' i b<\J i - term in 
the corresponding action integral and by writing the potential term (2.25) open after which 
the Lagrangian is 

= E ( - 2aa '^ - + ^m^) 5 ^- (2 - 29) 
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By noticing that the terms inside the brackets equal the Euler-Lagrange equation of motion 
of the background field i the first order Lagrangian terms equal zero. At second order the 
Lagrangian can be easily read directly from (2.24) 

^ =£° 2 (£ - ^gP) +«V 2 >(0,......fe)). (2.30) 

x i 

In order to derive the Hamiltonian equations of motion the necessary canonical momen- 
tums have to be first defined. For the scale parameter we get 

dC 

Pa = da/ = ~ Qa ' VLm Pi (2.31) 
and for the background field variables 

no,i = = a a ^ >4 . (2.32) 



and finally for the field perturbations 



By doing again a Legendre transformation we get the Hamiltonian function of the 
perturbed system 

/, 2 1 \ ( 2 - 34 ) 
. V-v 4 d7T i,x <%af (£) T ,(2)/, , x\ 

+ L« [-Mr 2w — + y { \^-^)\. 

i,x \ / 

In the equations of motion corresponding to this Hamiltonian we will now assume that the 
perturbation part is negligible compared to Ti^ and does not affect the dynamics of 
the scale factor or the homogeneous field equations. This way the background equations can 
be solved independently of the perturbations. 

With this reasoning we can write the Hamiltonian equations 



, 9U p a 

a 



dp a 6V L m 2 pl 
dU ^„ ,/ir, 



2 



da 

and for the homogeneous scalar field i 



(2.35) 



, on 4 dV 

*0,i = - af , \ = ~ a 



(2.36) 



0(00,0 0(^0,0 
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and the perturbation evolution is given by 

dU Sir* 



d(Smf) a 2 



6tt' 



d{5<t>ij) dx 2 d(8</>i,z)' 

Since these perturbation equations are linear we have chosen to evolve these in Fourier space. 
The corresponding equations read 



l.k 



5U. r 

i,k 



Oil. r = r~s a > 



where — /c 2 ff is the wave number corresponding to the discrete Laplacian operator D[c/)](z), T() 
is the discrete Fourier transform operator and the term inside the brackets can be easily read 
from equation (2.26). Note that inside the lattice there are multiple positions at which k 2 s 
terms get identical values and hence the perturbation equations of field i are identical at these 
positions and any differences in their evolved values originate from different initial values. 
We use this observation to reduce the number of solved differential equations significantly. 

2.3 Symplectic integration 

After constructing the discretized Hamiltonian functions (2.20) and (2.34) the next task 
is to select a suitable integration method. Remembering that the right hand side of the 
Hamiltonian corresponds to the first Friedmann equation (2.18) a natural selection is an 
integrator that preserves the value of the Hamiltonian leading to a solution consistent with 
the Friedmann equations. 

Symplectic integrators are a group of integrators that have this property and are there- 
fore ideal for the task. A further simplification comes from the observation that the Hamil- 
tonian functions (2.20) and (2.34) can be split into parts that can be integrated explicitly. 
By composing these sub-integrators suitably together higher order integrators of the whole 
system can be constructed. A good introduction to the subject of different splitting methods 
and geometric integrators can be found for example in [37]. 

In general the equations of motion of a Hamiltonian system with canonical variable and 
momentum z = (q,p) can be written in the form 

z' = { z ,U} = V n z, (2.39) 

where we have introduced the operator T>u = {,%} and {,} is the Poisson bracket. The 
formal solution of this equation reads 

z( V ) = exp{ V V n )z{0). (2.40) 

Assuming that the Hamiltonian can now be split into two integrable parts Hi, i = 1,2 
the task is to find a suitable composition of the operators exp(ryP-^ 1 ) and exp(r/2?-^ 2 ) that 
approximates equation (2.40) up to some order. More formally a set of real coefficients c\ 
and di need to be determined in the following equality 

k 

exp (ri(V Hl +V H2 )) = II ex P (ci^i) ex P ( d iWn 2 ) + 0(r, n+1 ). (2.41) 
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This can be done for example with the help of the famous Baker-Campbell-Hausdorff for- 
mula. See [38] for further details. For example a second order accurate symplectic integrator 
corresponds to values c\ = 1/2, d% = 1 and C2 = 1/2 and reads 

$( 2 )(r?) = exp exp (rfDnJ exp (J^) • (2.42) 

Second-order integrator however might not be precise enough for all applications and 
higher order methods are needed. One way to proceed is to compose higher order integrators 
from a second order one by recursion [38] 

$(2n+2)^ = $(2n)^ ir? ) $ (2n)^ o77 ) $ (2n)^ i?7 ) ) ^43) 
where the different coefficients equal 

1 

22n + l 

zo = J — 



2_22(2n+l) , 2U , 
1 

zi = — . 

2 - 2 2 "+i 

The downside of this method is that the number of times <3?2 (v) i s evaluated by an integrator 
of order 2k is 3 fc_1 + 1 and hence grows exponentially. We have therefore used this method 
only at 4th order integrations. 

A remedy to this problem [38] is to consider symmetric integrators composed of second- 
order symplectic maps 

$W( 7? ) = <Z> 2 ( Ws r])®2(w s-irj) ■ ■ • ^(^^(unr/), (2.45) 

where the coefficients satisfy a symmetricity condition w s +i-i = Wi and s is odd. This 
way the number of steps needed can be reduced dramatically: at 8th order it is possible to 
compose an integrator with 15 evaluations of $2(7?) compared to 28 iterations needed with 
the recursive method (2.43). In Table 1 we have listed the values of Wi for i = 1, (s + l)/2 
used in PyCOOL taken from [38, 39]. Note that the rest of the values follow from symmetry. 
A suitable splitting of the Hamiltonian (2.20) derived previously is 

Ui - p2a 



12V L m 2 pi ' 


i,x 




i.x 


( 4>i. 



2a?dx 2 



4 ( VLP-y.O , VlPmfi 



+ Vfag, ...,(j>N,x) 



since now the scale factor is evolved by Hi, scalar fields by the second Hamiltonian and the 
canonical momentums of the fields in the last part. The canonical momentum of scale factor 
is evolved in %2 and The Hamiltonian equations related to these can be integrated 
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explicitly to give flows 



nsr 9 dfj, p a , (f>ig, TT^g J, 

6V L m z pi I 



^n 3 ( d v) ■ a,Pa,(t>i,x,ni,x-\ ^ \a,Pa + 



£ 

i,x 

E 



-.3 / hx 



dr],4>i,x + 7rt ^dri,7r i) s 



i 2 dx 2 



dr), fag, 



Ki,x + 



■ D[<j>i,x](x) 

dx 2 



t 

d(<f>i,x) 



drj 



(2.47) 

By writing H = T-Ci + H2 = Hi + H2 + %3 and by performing the splitting used in equation 
(2.42) twice a second order integrator scheme of the scalar field system reads 



$%\dri) = $ Hl 



2 " 2 2 tt3V /; tt2 V 2 / ~' L1 V 2 



(2.48) 



Higher order integrators can now be easily constructed from this expression with the methods 
presented above. 

A splitting of the linearized Hamiltonian (2.34) is similar to the non-linear case. However 



Order 


Stages 


Coefficients 


6 


7 


wi = 0.78451361047755726382 
w 2 = 0.23557321335935813368 
w 3 = -1.17767998417887100695 
1U4 = 1 — 2(wi +w 2 + w 3 ) 


8 


15 


wi = 0.74167036435061295345 
W2 = -0.40910082580003159400 
w 3 = 0.19075471029623837995 
W4 = -0.57386247111608226666 
w 5 = 0.29906418130365592384 
w 6 = 0.33462491824529818378 
w 7 = 0.31529309239676659663 

w 8 = 1 - 2 Yl=l w i 



Table 1. List of the used integrator coefficients in PyCOOL. 
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now the perturbation part gives two additional Hamiltonians: 
Hi 



Pa 



2 ' 



12V L m 2 pi 



4 f VlP-yfi VLPmfl 

A ~l Q 



(2.49) 



By writing H = Hi + H2 = • • • = %i + %2 + H3 + ^4 + ^5 and by using equation (2.42) a 
second order integrator scheme of the perturbed system reads 



$%\dr,) =* Hl 



$n 4 



dij 
T 
drj 



drj 

T 

drj 

T 



$H 2 



*W (y «.(•''/ )- 



f ) **, (f 



(2.50) 



The order of this integrator can be verified for example with Huang's Python script that is 
available at http://www.cita.utoronto.ca/ zqhuang/work/symp6.py. 

When integrating the perturbation equations we use a similar method that was given 
in [45] to reduce the number of the integrated differential equations. In general the solution 
of a linear differential equation group can be written as 



/ 01 (r, keg) \ 
7Ti(t, k eS ) 

4>N(T,k eS ) 
\tT N (t, k eS )/ 



M(T, k 



/>l(0,ftaff)\ 
7Tl(0, k cS ) 

<f> N (0,k e{ [) 

\-K N (0, k eS )/ 



(2.51) 



where M(r, k e g) is an n x n matrix and we have written explicitly the effective mode de- 
pendence of the equations. We now use this equation to evolve the field perturbations with 
the following method: by remembering that in the lattice the effective wave number k^ s 
will get equal values at a number of different locations we will first calculate the distinct 
values of k^ s and then evolve the growth functions of a given field related to the modes 
M(t, fcg ff )l.j, where lj = (0, . . . , 1, . . . , 0), i.e. a vector with one at position i. We then write 
the right hand side of equation (2.51) as a linear combination of the different initial values 
4>i(0,k e s) and 7Tj(0,/c e ff) and use the solved growth functions to evolve the Fourier modes of 
the perturbations. 

2.4 PyCUDA implementation 

The symplectic integrators presented above have to be naturally also implemented in some 
computer language. We have chosen to use a combination of Python and Nvidia's CUDA 
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GPGPU language to make a program that is easy to adapt to different models of preheating 
but is also computationally very fast. The use of Python generally leads to some overhead 
compared to a pure C-language program and hence a slower program but the easy of use of 
the program more than makes up this deficiency. 

The program uses object-oriented programming extensively: the lattice, the potential 
function, the simulation and the fields are defined as python objects whereas the symplectic 
integration functions are implemented as subroutines of a python object that holds all the 
CUD A functions used in the integration steps. The program also uses extensively textual 
templating to write the necessary CUDA codes. After the end user has written a model 
file that defines the necessary fields, potential functions and the initial values the program 
will read these variables and values and write the necessary CUDA codes from predefined 
template files. This means that in ideal situations the end user only has to write the potential 
functions of the fields as a list of strings that PyCOOL then implements in suitable CUDA 
form with Python's SymPy package and custom string formatting codes written for the 
program. The final CUDA kernels are written for debugging and verification purposes into 
a separate folder where they can be easily inspected for any errors. 

The initialization of the different fields is done as in Defrost with a convolution based 
method. The details of this method can be found in [30]. These routines are included in a 
separate file that can be easily modified to include additional initialization algorithms. 

The implementation of the non-linear symplectic integrator (2.47) on GPU is similar to 
the one used in CUDAEASY [32] with some major upgrades. We will use similar compu- 
tational labor division as in CUDAEASY where the scalar fields were evolved on the GPU 
whereas the scale factor evolution was calculated with the CPU. The main differences come 
from the use of the very precise symplectic algorithms compared to a second order leap- 
frog algorithm and the use of double precision calculations. We have also optimized the 
CUDA kernels to use fewer registers compared to CUDAEASY which means that in theory 
more thread blocks can be calculated simultaneously. However the use of double precision 
arithmetics roughly doubles the number of the used registers which somewhat negates these 
improvements. 

We will briefly go through the steps involved in the GPU evolution. Further details 
can be found in the CUDAEASY paper [32] where the used CUDA terms were also defined. 
CUDA implementation of 3D stencils needed in the H3 part was presented in detail in [40]. 
We will follow a similar method that uses shared memory to calculate the Laplacian operators. 
Because of the limited amount of shared memory of the GPU the 3 dimensional lattice has to 
be divided into smaller pieces. To accomplish this we have sliced the lattice along the z-axis 
into smaller tiles and advance these tiles in z-direction. The computations within these tiles 
are done by CUDA thread blocks. This means that a single thread of a thread block will 
advance all the scalar fields of a column with constant x- and y-coordinates. Since the outer 
threads of a thread block need the values of scalar fields that are in a different block and since 
different blocks can only communicate through global memory the role of the outer threads 
of a thread block is only to load data into shared memory and the scalar field computations 
are done by the inner threads. The computation starts from the bottom of the lattice i.e. at 
z = and proceeds to the top. 

The integration of fag does not include any stencil operations that makes this step 
computationally much easier. Summations over the lattice that are needed when updating 
p a are performed also in the GPU and are done simultaneously with the evolution steps of 
fa t $ and 7Ti#. The energy density and pressure of the scalar fields i.e. equations (2.16) are 
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also calculated on the GPU. However since the symplectic evolution equations (2.47) do not 
explicitly depend on these variables they are not calculated at every step. This way the 
number of times the computationally demanding G[(f>i g] terms given in equation (2.8) need 
to be calculated can be reduced significantly. How frequently to calculate the energy densities 
and to write them to a file is decided by the user. 

The linearized code is somewhat different and we will briefly explain also the main steps 
involved. The main difference is that we now evolve all of the equations on the GPU. The 
used CUDA grid is chosen to be such that every thread calculates the evolution of all of the 
growth functions M{t, that have the same value of foL, i.e. there is a CUDA thread 

per a distinct value of fcJL. The evolution of all the background variables is done at the 
CUDA block level meaning that the scale factor and the canonical momentum p a are evolved 
by the first thread of a block. These values are stored in the shared memory which is visible 
to every thread in a block. Compared to the non-linear code where scale factor was evolved 
by the CPU this way the memory transfers between GPU and the CPU can be eliminated 
which results in a further speedup of the code. This way different thread blocks don't have 
to be synchronous meaning that no thread block idles during the calculation steps. After the 
growth functions have been integrated a given number of steps forward we use a CUDA kernel 
to read the growth functions related to a position in Fourier space in the lattice and then 
evolve the perturbation mode with equation (2.51). Overall this method leads to roughly 15 
times faster code compared to a naive version where every perturbation field mode is evolved 
independently. 

Note that the linearization part has been tested only with polynomial potential models 
and in general the use of the non-linear integrator is preferred. Hopefully in future releases 
the linearized code is expanded further and improved. 

The results of the program are written intermittently into a SILO file that can be easily 
plotted with Visit program provided by LLNL. After the simulation of dynamics is done we 
have included a post processing phase which calculates a number of different spectra. These 
include field, number density and energy density spectra that are used in LATTICEEASY 
program [29] and the field spectrum algorithm used in Defrost [30]. These post processing 
functions have been coded in Cython in order to make these computations much faster 
compared to ordinary Python implementation. The results are written as curves in the SILO 
file which can be easily plotted in the Visit program. 

It is also possible to perform non-gaussianity and curvature perturbation related calcu- 
lations with PyCOOL. This is done by using the AN formalism and the separate universe 
approximation [42-44] when solving the evolution of the system with different initial values. 
We will not however apply this functionality in this paper as it will be done in a follow-up 
paper where we study the curvaton preheating in closer detail. 

3 Numerical results 

We have tested the functionality of the program with three different scalar field models. We 
will use units where the reduced Planck mass m p i = {&TtG)~ l l 2 is set to one in all of the 
simulations. Note also that only the model file that defines the necessary fields, potential 
functions and the initial values changes from one simulation to other and the actual evolution 
code stays unchanged. 

We use the absolute value of the relative residual curvature 
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to measure the conservation of Hamiltonian in all of the examples. 

We have included plots of some of the spectra and post-processing data generated by 
the program in this section. For further examples of the output generated by the program 
see the home page http://www.physics.utu.fi/tiedostot/theory/particlecosmology/pycool/. 

3.1 Chaotic inflation model 

We have first tested the program by solving the evolution of scalar fields in the simple chaotic 
inflation model which has become a standard test scenario studied with LATTICEEASY [35] , 
DEFROST [30], CUDAEASY [32] and with PSpectRe [33]. The potential in this case is 

^i) = imV + ^W, (3-2) 

where <f) is the inflaton and if) the decay product. 

The initial values have been set to coincide with the values given in [30, 35]. This means 
that homogeneous value of the inflaton is set to <f> ~ 1.009343 and the decay field if) = 0, the 
mass of the inflaton equals 5 • 10 _6 m p i, the Hubble parameter H ~ 0.50467m, and the value 
of the coupling constant g = 100m.. We have evolved the system with conformal time step 
drj = 0.0001/m and lattice size 256 3 until t p h ys = 250/m. 

We have done the simulations with two different systems for comparison. The more 
expensive computer has a hexa core AMD processor, 16 GB of memory and an NVIDIA 
Tesla C2050 compute card whereas the much cheaper consumer computer has a quad core 
processor and an NVIDIA GTX 470 graphics card. Since NVIDIA has limited the double 
precision speed of the Fermi consumer graphics cards to one quarter of the Tesla series we 
can also test what kind of difference this makes in the simulation time. We use thread blocks 
that contain 324 = (18 2 ) threads of which computations are done by the inner 256 threads. 
This configuration was found to be the fastest of 64, 128 or 256 inner threads per block. 
The CUDA grid size is based on the size of lattice and for a 256 3 lattice we use 256(= 16 2 ) 
thread blocks. Note that these simulations were run with identical seed values in the random 
number generators meaning that the initial random field values were also identical. 

The numerical results for fourth order integrator are shown in figure 1 (a) where we have 
plotted a- z / 2 H- 1 ELS 3j function of the scale parameter. Note that in a matter dominated 
universe (H 2 oc o -3 ) this should be a constant. In figure 1(b) is the plot of the absolute value 
of the relative residual curvature (3.1). Compared to the accuracy of Defrost and PSpectRe 
presented in [33] PyCOOL is more than 5 orders of magnitude more precise. Note however 
that by integrating in conformal time the step sizes increase in physical time as the scale 
factor increases. This is also what most likely causes the numerical error to increase in time. 
By using an adaptive conformal step with constant increases in physical time the numerical 
error will stop to increase. This will however lead to roughly 20 % longer simulation times. 
Another possibility would be to solve the evolution of this model in physical time as is done 
usually [30, 35]. We intend to add this possibility in a future version of PyCOOL. 

We have included the energy density spectra of the scalar fields in figures 1(c) and 
1(d). In these figures it can be seen that the preheating process excites modes of the if) field 
with k/m < 20. These will eventually couple with the inflaton field leading to the phase of 
non- linear of dynamics of the system [35] . 

The total computation for one step when using the fourth order integrator is on average 
0.132 seconds when using a Tesla card which corresponds to a total running time of roughly 
eight hours for one simulation with 256 3 elements. The GTX 470 card is roughly 15 % slower 
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(c) (d) 

Figure 1. (a) The evolution l/(a 3 ^ 2 H) as a function of the scale parameter in the chaotic potential 
model, (b) The absolute value of the relative residual curvature (3.1) as a function of physical time in 
units of 1/m. The energy density spectrum of (c) <f> field and (d) ip field. Note that in these spectra 
the color evolves with time and the lightest curve (at the top) is calculated at < p h ys = 250/to whereas 
darkest one (at the bottom) are evaluated at t p h ys = 0/m. Please see the online version of this article 
for color figures. 

integrating one step in 0.152 seconds. However by remembering that the double precision 
speed of the Tesla card is four times that of GTX 470 this difference can be seen to be minimal. 
When taking the rather high price of the Tesla cards also into account the consumer cards 
seem to offer a better choice for simple lattice calculations even at double precision accuracy. 

When comparing these results to LATTICEEASY, Defrost and it should be remembered 
that we are using a fourth order accurate integrator. Lowering the order of the integrator 
to two which is also used in the aforementioned programs would lead to roughly three times 
faster computations but this would naturally lead to a lower accuracy of the simulations. A 
similar simulation as was used in Defrost with constant physical time steps and a second order 
integrator would take roughly three hours. For comparison a Defrost simulation execution 
speed is 0.32 s per step on quad core Xeon machine meaning that when using equal integrators 
PyCOOL is roughly six times faster. We also tested the speed of PSpectre on the faster test 
machine with a hexa core AMD processor. A test run showed that a 256 3 simulation takes 
roughly 5.1 seconds per step meaning that PyCOOL is roughly 115 times faster when using 
integrators of equal order. 
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3.2 Oscillons 



Oscillons are pseudo-stable, non-topological solitons that can form in the post-inflationary 
universe [47, 48] . These have been studied numerically lately in [49] for one field case and 
in [50] for two field oscillons. The generation of these oscillons leads to a matter dominated 
phase after resonance which differs from the relativistic equation of state observed in the 
chaotic inflation after preheating. 

We will simulate the oscillon production with a model that was used in [49]. The 
potential function is set to be 

V&O = - \^ + £-y (3.3) 

where A > and (X/g) 2 <C 1 are necessary conditions for so called flat-top oscillons [49]. 
Note that this potential is assumed to be valid only in the post-inflationary universe. For 
the parameters and initial values we use the following values: m = 5 • 10 -6 m p i, A = 2.8125 • 
10~ 6 , g 2 = A 2 /0.1. We also set the homogeneous value of the oscillon field to be tpo = 
\j (3A) / (5<7 2 ) m and the canonical momentum ttq = 10~ 16 m. The lattice side length was set 
to L = 400/m and dr] = 0.005/m. We will use a lattice of 128 3 points whereas the simulations 
in [49] were done with at least 256 3 points. Note that an additional post processing function 
was used in [49] to detect the the oscillons from the data. Implementation of this function 
in PyCOOL is beyond the scope of this paper but could be included in a future version of 
the code. 

We have plotted the evolution of a~^/ 2 H~ l term as a function of the scale parameter in 
figure 2(a). In figure 2(b) we the plot of the absolute value of the relative residual curvature 
(3.1) which use again as a proxy for numerical accuracy. Compared to the results of chaotic 
potential simulations the algorithm is now less precise. We are however using a smaller lattice 
which will cause some of this error and a larger time step which is naturally another source 
for the observed inaccuracy. 

We have also plotted the number density spectrum of oscillons for illustration in figure 
2(c) with different shades of orange. The darkest curves correspond to initial values whereas 
the brightest curves are calculated close to the end of the simulation. From the spectrum it 
can be seen that the fragmentation of the oscillon field happens initially close to k ~ 0.1/m 
band in the momentum space. However as the system evolves other modes are excited as 
well and the peak close to k ~ 0.1/m vanishes. We have also included a snapshot of the 
energy density of the system in figure 2(d) at t p h ys = 9821. 88/m. The system is clearly seen 
to fragment into blobs of scalar matter. 

These simulations were done with the Tesla card in 5 hours and 55 minutes leading to 
an average time of 0.0109 seconds per one step. The same conclusions apply as previously: 
lowering the order of the integrator to two leads to three times faster simulations whereas 
increasing the size of the side of the lattice by two leads to eight times longer simulation 
times. 

3.3 Curvaton model with Q-balls 

The curvaton mechanism is a well studied alternative to the usual inflation scenario [51]. 
The main idea is that the curvaton is a light scalar field that stays sub-dominant during the 
inflation process and does not contribute to the expansion of the space. After the inflation 
has ended the curvaton will eventually become massive compared to the Hubble rate H and 
it will start to oscillate and eventually decay into relativistic particles. 



- 16 - 



|K|/(aW) 
1 n _tJ r 




t m 



(c) (d) 

Figure 2. (a) The absolute value of the relative residual curvature (3.1) as a function of physical time 
in units of 1/m. (b) The equations of state of the field and a moving average of 50 steps over the data 
points used. The equation of state is clearly close to a non-relativistic matter. The number density 
spectrum of (c) <p field. Note that the color evolves with time and the lightest curve (at the top) 
is calculated at i p hy S = 250 /m whereas darkest one (at the bottom) are evaluated at t p hy S = 0/m. 
In (d) we have the volume plot with isosurfaces of the energy density of the system calculated at 
^phys = 9821. 88/m. Please see the online version of this article for color figures. 

Supersymmetric versions of the curvaton scenario have been studied in [59-61]. In 
supersymmetric theories there are a number of directions in the field space in which the 
potential function vanishes called flat directions [58]. The Affleck-Dine (AD) mechanism 
[52] uses these flat directions for the baryogenesis. It was noted in [53, 55] that the AD 
mechanism can lead to the production of a large number of clumpy scalar field objects called 
Q-balls. These are a class of non-topological solitons that are frequently generated in different 
supersymmetric theories of the early universe and have been since studied numerically in two 
and three dimensional simulations [54-57]. 

We will now assume that the Affleck-Dine (AD) field plays the role of the curvaton. 
Although this model might not be consistent with current cosmological observations [61] we 
will still use it to study the functionality of the algorithm. The potential function in the 
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gravity-mediated SUSY breaking model is 



V($) = m a m 2 (l + Klog(^^ - gH 2 \^\ 2 + -^^f (3.4) 

where $ is the complex curvaton field and the curvaton mass m a , M* is the renormalization 
point, g a constant of order one and mpi the reduced Planck mass. We will use very similar 
initial and parameter values that were employed in [56], namely m a = 10 2 GeV, M* = 10 14 
GeV, K = —0.1, A = 0.5. Note that the gH 2 -teim can be omitted since gH 2 <C m, 2 . The 
initial values for the fields are set to be 

$(i P hys,o) = 2.57 • lOVx, ^(t phy s,o) = 2.57 • 10 7 m 2 i (3.5) 

that were used in [55]. We will assume that the universe is dominated by radiation that 
originates from the inflaton decay. Initial time is set to t p h ys = 100/m CT and the energy 
density of the homogeneous radiation component is initially p 7 = 4/(3i 2 hys ) which sets the 
initial Hubble parameter value equal to the one used in [56]. The initial inhomogeneous 
perturbations are created with the method given in Defrost. We use a 128 3 lattice with side 
length set to L = 8/m a and time step dr\ = 0.005/m CT . 

We have plotted the absolute value of the relative residual curvature in figure 3(a). The 
algorithm is seen to be accurate even with a fourth order integrator. We have also included 
a plot of energy density spectrum of the real part of the field in figure 3(c). The generation 
of Q-balls can be seen as a clear peak in the infrared modes during the early part of the 
simulation. In figures 3(c) and 3(d) a snapshot of the energy density of the system with 
isosurface plots is given at t p h ys = 645.345/mCT and i p h ys = 5088.03/m<7 respectively. Further 
study of this system would necessitate the use of post-processing algorithm that identifies the 
Q-balls inside the lattice [57]. This procedure could be also implemented in CUDA language 
and included with the program but we have however decided to leave this as a possible future 
upgrade of the program. 

These simulations were done with the Tesla card. One simulation takes roughly 2 hours 
and 33 minutes meaning an average time of 0.032 seconds per one step. Note that due to the 
more computationally demanding potential function the code is roughly three times slower 
than in the oscillon case. A quick test shows also that the consumer card is roughly 60 % 
slower than the Tesla version which is mainly due to the difference in computational speed. 
Simulation of a larger 256 3 lattice would take roughly 20.5 hours with the Tesla card. 

4 Discussion and conclusions 

We have presented a PyCUDA implementation of symplectic integrator of scalar fields in 
an expanding space. Although the program has its roots in LATTICEEASY, Defrost and 
CUDAEASY programs it is based on an entirely new algorithm developed by A. Frolov and 
Z. Huang [31]. Due to its sympleticity the algorithm is able to conserve the first Fried- 
mann equation to very high precision. We have also further improved and optimized the 
used method to better suite the GPU. For simulations in which a linearized version of the 
equations suffice we have derived a linearized Hamiltonian equations and implemented the 
corresponding symplectic integrator also on the GPU. 

Most of the program has been written in Python language to make it as friendly as 
possible to the end user. Although Python generally leads to some overhead when compared 
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Figure 3. (a) Plot of the relative residual curvature (3.1) as a function of physical time in units of 
1/m. (b) The energy density spectrum of the real component of <£. Note that the color evolves with 
time and the lightest curve (at the top) is calculated at t p h ys = 10000/m whereas darkest one (at the 
bottom) are evaluated at i p hy S = 0/m. The generation of Q-balls can be seen as a spike in the small 
k values of the early spectra. Energy density isosurfaces calculated at (c) t p h ys = 645.345/to ct and at 
(d) iphys = 5088.03/mg. in the curvaton scenario. The curvaton field clearly fragments into Q-balls. 

to a pure C-language implementation and hence a longer runtime the easy of use of the 
program more than makes up this deficiency. When comparing PyCOOL to previous scalar 
field integrators the cost of changing to a different scalar field model has been made much 
lower. This is based on the textual templating functionality used in PyCOOL (and made 
possible by PyCUDA and jinja2 library) that creates automatically the simulation code from 
template files meaning that the end user should only rarely need to edit the actual simulation 
code. In order to study a new model a simple model file that defines the necessary fields, 
potential functions and the initial values needs to be created and in most cases the program 
will handle the rest. 

We tested the functionality of PyCOOL with three different models that lead to non- 
linear dynamics. In all of these we were able to produce results consistent with previous 
studies. We also tested the accuracy of the program by studying the conservation of the first 
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Friedmann equation. We found that the program was not only able to keep the errors many 
orders of magnitude smaller compared to older lattice evolution codes but that it can also 
be orders of magnitude faster than a multithreaded CPU powered code. 

To make the program more useful to the end user we have included various postprocess- 
ing functions that can be used to improve the understanding of the non-linear process under 
study. We have illustrated this functionality with selected figures in the numerical results 
section. Further examples of the output generated by the program will be added to the home 
page of the program http:/ /www. physics. utu.fi/tiedostot/theory/particlecosmology/pycool/. 

Although PyCOOL has been done with the intention to make it as good as possible there 
are several venues were it could be improved. Currently the program uses conformal time 
in the integration steps. It would be however rather easy to implement also an integrator 
that uses physical time in the integration steps. This could be also expanded to make it 
possible to use a more general time variable similar to LATTICEEASY. Other possibilities 
include a possible pseudo-spectral version to calculate the Laplacian operators. Also with 
the advent of CUDA 4.0 software multi-GPU implementations have become much easier to 
manage. A possible multi-GPU version of PyCOOL could either solve very large lattices 
faster or alternatively it could be used to evolve many simulations simultaneously which 
could be used in non-gaussianity related calculations [41-43, 45] where a large number of 
lattices with different initial conditions needs to be simulated. A version that uses distributed 
computing to calculate non-gaussianity would also be an interesting possibility especially 
when remembering that the performance of the consumer graphics cards was very close to 
the Tesla cards in the chaotic potential model. 

In summary, we believe that PyCOOL has great potential to be very useful to the cos- 
mology and particle physics community when studying different non-linear post-inflationary 
processes with high precision and speed. 
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